Batch processing is a foundational computing paradigm in big data systems, designed to handle massive datasets with high throughput and strong accuracy guarantees. As one of the core components of big data computing, it plays a critical role in transforming raw data into structured, analyzable results.
In the previous article,
https://dataget.ai/wp-admin/post.php?post=547&action=edit,
we explored how HDFS became the cornerstone of big data storage through its distributed, scalable, and fault-tolerant design. Building on that foundation, this article focuses on the computing layer from
https://dataget.ai/wp-admin/post.php?post=543&action=edit.
In practice, big data computing mainly consists of batch processing and real-time processing.
In this article, we focus specifically on batch processing, examining its core principles, architecture, mainstream frameworks, and real-world application scenarios.
What Is Batch Processing?
Batch processing is a big data computing model in which systems collect data over a defined period, process it in bulk, and generate results in a single execution cycle.
As a result, this model fits scenarios that involve large data volumes, complex computation logic, and high tolerance for latency, such as daily or monthly reporting.
At its core, batch processing follows three clear steps:
- Collect input data into batches
- Execute parallel computation across distributed nodes
- Aggregate intermediate results into a final output
Consequently, batch processing exhibits several defining characteristics:
- Data immutability – Engineers treat input data as read-only once collection completes.
- Large-scale computation – Systems often process data accumulated over days, months, or even years.
- Scheduled execution – Jobs run at predefined intervals rather than continuously.
- High accuracy – Complete datasets enable precise and reproducible results.
- High throughput – Distributed execution fully utilizes cluster resources.
Batch Processing Architecture
Following its core principles, a batch processing system typically consists of several layers:
- Data Source – The origin of the data, usually generated by business systems; may include database records or event tracking logs.
- Ingestion Layer & Storage – Transfers data from sources into the batch processing system for storage, often through storage-specific SDKs or APIs. Common storage mediums include HDFS and Amazon S3.
- Compute Engine – The core component that loads, filters, and transforms data.
- Meanwhile, resource management and scheduling systems allocate cluster resources and coordinate execution. In most production environments, YARN performs this role.
- Output Layer & Storage – Writes computation results to a target storage system via SDK or API, similar to the ingestion process.
Batch Processing Frameworks
Two widely used batch processing frameworks are MapReduce and Spark.
- Hadoop MapReduce – The open-source implementation of Google’s theoretical model, and one of the earliest distributed batch processing frameworks. It consists of two core phases:
MapandReduce. - Spark – Evolves from the MapReduce model by introducing in-memory computing and DAG (Directed Acyclic Graph) task scheduling, enabling more complex computations.
The MapReduce programming model splits jobs into two main phases:
- Map – Transforms input data into a set of key/value pairs.
- Reduce – Aggregates values associated with the same key and outputs the final result.
However, complex workloads often require chaining multiple MapReduce jobs.
As a consequence, each stage writes results to disk and reloads them in the next stage. This design increases I/O overhead and prolongs execution time.
Spark job execution flow:
- Submit the job via Spark SDK or command line
- Construct a DAG from transformation operators
- Split the DAG into stages and schedule tasks
- Shuffle data only when global aggregation or joins require it
- Write final results to target storage systems
Application Scenarios
Batch processing still holds a critical position in enterprise data architectures. Common use cases include:
- Offline Data Warehouse Construction – Using Hive to clean and aggregate raw data into wide tables for business analytics.
- Offline Data Analysis – Performing statistical analysis across various dimensions in a data warehouse.
- Machine Learning Model Training – Processing historical data in bulk to produce datasets for model training.
Limitations & Challenges
Despite its importance, batch processing faces several challenges:
- High Latency – Lacks real-time responsiveness, unsuitable for low-latency use cases.
- High Resource Usage – Large datasets require significant computation resources.
- Data Skew – Hotspot data can grow disproportionately, leading to uneven workload distribution.
- High Maintenance Costs – Failed jobs often require a full rerun, consuming time and resources.
Conclusion
Within the big data ecosystem, batch processing forms the backbone of offline analytics and large-scale computation. It effectively connects distributed storage systems with downstream analytical applications.
However, modern businesses increasingly demand low-latency insights.
Therefore, while batch processing remains indispensable, it now works alongside real-time computing rather than replacing it.
In the next article, we will explore real-time processing, explaining how it complements batch processing and addresses latency-sensitive business requirements.